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Abstract —In many mobile visual analysis applications, com¬ 
pressed video is transmitted over a communication netvrork and 
analyzed by a server. Typical processing steps performed at the 
server include keypoint detection, descriptor calculation, and 
feature matching. Video compression has been shown to have an 
adverse effect on feature-matching performance. The negative 
impact of compression can be reduced by using the keypoints 
extracted from the uncompressed video to calculate descriptors 
from the compressed video. Based on this observation, we propose 
to provide these keypoints to the server as side information and 
to extract only the descriptors from the compressed video. First, 
we Introduce four different frame types for keypoint encoding 
to address different types of changes in video content. These 
frame types represent a new scene, the same scene, a slowly 
changing scene, or a rapidly moving scene and are determined 
by comparing features between successive video frames. Then, we 
propose Intra, Skip and Inter modes of encoding the keypoints 
for different frame types. For example, keypoints for new scenes 
are encoded using the Intra mode, and keypoints for unchanged 
scenes are skipped. As a result, the bltrate of the side information 
related to keypoint encoding is significantly reduced. Finally, 
we present pairwise matching and image retrieval experiments 
conducted to evaluate the performance of the proposed approach 
using the Stanford mobile augmented reality dataset and 720p 
format videos. The results show that the proposed approach of¬ 
fers significantly improved feature matching and image retrieval 
performance at a given bltrate. 

Index Terms —coding, H.265/HEVC, SIFT, keypoints, match¬ 
ing, prediction, retrieval. 

I. Introduction 

T he extraction of features from images or videos is a 
fundamental component of many computer vision algo¬ 
rithms. Ideally, the feature extraction process identifies fea¬ 
tures that are shift-invariant, scale-invariant, rotation-invariant, 
illumination-invariant, etc. The extracted features are fre¬ 
quently compared with features in a database to identify 
correspondences. Typically, in the feature extraction process, 
keypoints (also called interest points or salient points) are 
detected first, and then, local descriptors (also called feature 
vectors) are calculated from the image patches located around 
these keypoints. With the increasing ubiquity of camera- 
equipped mobile devices and high-speed wireless commu¬ 
nication networks, novel applications such as mobile visual 
search are emerging. In such applications, most of the feature- 
related processing is typically performed at a server. Note that 
feature extraction can be performed by either the client or the 
server. For client-side feature extraction, compressed features 
are uploaded to the server. The compression of the features is 
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expected to create only small distortions compared with the 
uncompressed features. For server-side feature extraction, the 
images/videos are compressed and transmitted to the server. 
In this case, it is important to minimize the impact of the 
compression on the feature extraction performed at the server. 
Ideally, the features extracted from a compressed image or 
video should be identical, or at least very similar, to the 
features extracted from the uncompressed version. Both client- 
and server-based feature extraction approaches fall into the 
emerging area of feature-related compression approaches. 

Previous studies of feature-related compression can be 
categorized into three classes. The first class involves direct 
compression of the features. For example, several studies Q- 
0 have proposed methods of encoding scale-invariant feature 
transform (SIFT) features Q extracted from images and video 
sequences. Other studies 0-171 have proposed the encoding 
of binary features (e.g., BRISK features Q), and algorithms 
of this type are named the analyze-then-compress (ATC) 
paradigm. Along the same lines, the compact descriptors for 
visual search (CDVS) 19l- im standard aims to standardize 
technologies for feature encoding at low bitrates. In the second 
class, canonical image patches are compressed and then trans¬ 
mitted or stored for further processing |j^, p^-p^. Third is 
the standard image-compression-based architecture, which the 
authors of 0 have dubbed the compress-then-analyze (CTA) 
paradigm. Furthermore, a few approaches involve modifying 
standard image/video compression algorithms such that the 
features extracted from the compressed images/videos are as 
similar as possible to the features extracted from the uncom¬ 
pressed images/videos. These approaches (also belonging to 
the third class) are referred to diS feature-preserving image and 
video compression tm-iD- To this end, the authors of 
1161 optimize the JPEG quantization table, whereas in 


1181, the rate allocation strategy is modified to preserve the 
most important features. Recently, GD has proposed encoding 
the SIFT keypoints and transmitting them as side information 
along with the compressed image. Similarly, |20| proposed 
transmitting the encoded BRISK keypoint locations, scales, 
and differential BRISK descriptors along with the image to 
a server. In both approaches, the keypoints are sent as side 
information for improved feature extraction from compressed 
images. Compared with the first two classes, the advantages 
of feature-preserving image/video compression are that the de¬ 
coded images/videos can also be viewed and stored for future 
use and that other types of features can later be extracted. For 
solutions providing standard-compatible images/videos, such 
as GD -1^, standard decoders can be used to decode the 
images/videos. In all studies mentioned above, the objective 
was to achieve low-bitrate data transmission for applications 
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such as mobile visual search, video surveillance, and visual 
localization. 

Few studies have addressed the compression of features 
extracted from videos. In Q, 0, the authors proposed intra- 
and inter-frame coding modes for SIFT descriptors and binary 
local descriptors, respectively, to encode the descriptors ex¬ 
tracted from a video sequence. The authors of 0 proposed 
inter-frame predictive coding techniques for image patches 
and keypoint locations, and they subsequently proposed an 
inter-descriptor coding scheme Q to encode the descriptors 
extracted from such patches. In these approaches, descrip¬ 
tors/patches are extracted from videos and compressed for 
transmission; however, the videos themselves are not stored 
or sent to a server. By contrast, in this work, we transmit 
compressed videos via a communication network to a server 
to perform feature extraction. The advantages previously dis¬ 
cussed with respect to images also apply to videos. Based on 
our previous study regarding images 0. we propose in this 
work a predictive keypoint encoding approach to encode the 
original keypoints extracted from uncompressed videos. In the 
proposed approach, the compressed keypoints are sent as side 
information along with the compressed videos at low bitrates. 
At the server, we use the decoded keypoints (locations, scales, 
and orientations) to extract feature descriptors from the videos. 
The proposed approach is illustrated in Fig. The proposed 
framework is fundamentally different from those reported in 
other studies 0. 0. which encode and transmit the descrip¬ 
tors. By contrast, in the proposed approach, only the keypoints 
are encoded and transmitted along with the compressed video. 
To evaluate the proposed approach, we conduct experiments 
using the widely used H.264/AVC and H.265/HEVC standards 
for video encoding. The results are presented as plots that 
show the number of successfully matched features versus the 
bitrate. In addition, we show the percentages of images that are 
successfully retrieved using various approaches via a content- 
based image retrieval engine. 

The remainder of this paper is organized as follows. 
In Section II, the impact of video compression on feature 
quality is examined. The feature-preservation performance 
of H.264/AVC and H.265/HEVC as a function of bitrate is 
presented, and the motivation for the proposed approach is 
explained in greater detail. In Section III, we introduce four 
different types of frames for keypoint encoding based on the 
changes in the video content and propose keypoint encoding 
approaches for each of these different frame types. Section 
IV presents the details of the proposed keypoint prediction 
framework. In Section V, we detail the Intra, Skip and Inter 
modes of keypoint encoding and transmission, which allow us 
to significantly reduce the bitrate of the side information. In 
Section VI, we present pairwise matching and image retrieval 
experiments. Our results show that the proposed approach 
offers substantially improved performance compared with that 
of standard H.265/HEVC-encoded videos. Conclusions are 
presented in Section VII. 

II. Impact of video compression on eeature quality 

In this section, we investigate the impact of video compres¬ 
sion on the features extracted from compressed videos. Eor this 


purpose, we first extract SIET features from videos encoded 
using H.264/AVC and I-1.264/HEVC and compare these fea¬ 
tures with the features extracted from a set of uncompressed 
reference images. Then, we plot the number of matches as a 
function of bitrate and compare the results for the different 
video compression standards. 


A. Experimental design 

Eor the matching performance evaluation, we choose the 
Stanford mobile augmented reality (MAR) dataset | |T4) , which 
comprises 23 videos (each containing a single static object) 
and 23 corresponding reference images. Each video consists 
of 100 frames (30 fps) at a resolution of 640 x 480. Similar 
to 1141, we use eight video sequences {OpenCV, Wang Book, 
Barry White, Janet Jackson, Monsters Inc., Titanic, Glade, 
and Polish) for pairwise feature-matching evaluations. Eur- 
thermore, we use SIFT features and the Vlfeat | |2T| SIFT 
implementation. Similar to 0. the top 200 features are 
selected for each frame in accordance with the CDVS Test 
Model. The nearest-neighbor distance ratio (NNDR) is used 
to evaluate matching descriptors. For a query descriptor D 
extracted from a compressed test frame, the nearest descriptor 
Da and the second nearest descriptor Db from the reference 
image are found, and thresholding is applied to their distance 
ratio. The query descriptor D and the nearest descriptor Da 
are considered to match if \\D — Da||/||F) — Di,\\ < t. We 
calculate the Euclidean distances between the descriptors and 
set t to 0.8 jffl. Then, we use random sample consensus 
(RANSAC) pij to remove incorrectly matched features, as¬ 
suming an affine transformation between the reference image 
and the test frame. 


B. Feature-matching performance for different video compres¬ 
sion standards 

The eight test video sequences are encoded using the JM 
reference software | |23| and the HEVC Test Model | |2^ . 
To be able to compare our results with the patch-encoding 
approaches presented in 0^0. we use the same parameters 
as in 0 for H.264/AVC. Additionally, we use more 
QP values to produce high-quality videos. The settings are 
as follows: the IPPP- • • structure, IntraPeriod = 50 frames, 
QPp/rames = {26, 30, 34, 38, 42, 46, 50}, and QPj frames 
= QPpframes -3. The parameters used for H.265/HEVC 
encoding are the same as those in the example provided in 
the FIEVC version 16.0 manual | |^ . The GOP structure is 
IBBBPBBBP- • •, and QP = (22, 26, 30, 34, 38, 42, 46, 50). 

In the following experiments, SIET features are extracted 
from the compressed frames and compared with the SIET 
features extracted from the corresponding uncompressed refer¬ 
ence images. The number of matching features after RANSAC 
is applied and the bitrates for the encoded videos are averaged 
over all test frames. The solid red and green curves in Eig. 
represent the feature-matching performance for I-I.264/AVC- 
and H.265/HEVC-encoded videos as a function of bitrate. 
The H.265/HEVC-encoded videos exhibit much better perfor¬ 
mance than the videos encoded with H.264/AVC in terms of 
the number of feature matches at a given bitrate. 
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Fig. 1. Overview of the proposed approach. The keypoints extracted from the video content at the client are encoded and transmitted as side information 
along with the compressed video to the server. 



Fig. 2. SIFT-feature-matching performance for different video compression 
schemes. The dotted line shows the results obtained for descriptors calculated 
from the H.265/HEVC encoded videos using the original keypoints extracted 
from the uncompressed video frames. 

C. Sensitivity of keypoints and descriptors 

Given an input video frame /, the detected keypoints can 
be expressed as 

ki — ( 1 ) 

where li is the location of keypoint i in the frame, ai is the 
scale, and 9i is the orientation of the keypoint. The descriptors 
obtained using the keypoints detected in frame I are expressed 
as follows: 

d, = ( 2 ) 

where 'I' represents the descriptor extraction operation on I 
using the keypoints ki. Similarly, the descriptors extracted 
from the compressed frame / are expressed as 

di = (3) 

Here, the ki are the keypoints detected in I. Similar to our 
previous study on keypoint encoding for improved feature 
extraction from compressed images 03’ we perform a simple 
experiment to demonstrate that the results of keypoint detec¬ 
tion can be easily affected by video compression artifacts and 
that descriptors are more robust. To this end, we calculate 
the descriptors from the compressed video frames using the 
keypoints extracted from the original (uncompressed) frames. 
This procedure is expressed as follows: 

< = (4) 

This means that the features d' have exactly the same 
keypoints ki as in the uncompressed case. This allows us to 
ignore the possibility of inaccurate keypoint detection from the 


compressed frame and evaluate exclusively the descriptor ro¬ 
bustness in the presence of compression artifacts. Because the 
original keypoints ki cannot be obtained from the compressed 
frame I, we present the matching results obtained based on 
H.265/HEVC encoded videos as a dotted line (upper bound) 
in Fig. For videos of higher quality, the gap between the 
HEVC approach and the HEVC+ori. kpts. approach in Fig. 
becomes increasingly smaller. However, high-quality videos 
also require a much higher bitrate and can be transmitted only 
if the necessary network resources are available. To avoid 
excessive bandwidth requirements for applications such as 
mobile visual search or video surveillance, we must compress 
the data to a low bitrate for transmission. This is the motivation 
for the current research on feature compression algorithms in 
the literature as well as the MPFG CDVS standard. Thus, our 
goal is not to provide high-quality video for human observers. 
Instead, we are interested in videos encoded at low bitrates 
for communication networks of limited capacity. The results 
indicate that if the original keypoints are preserved, then the 
feature-matching performance can be improved, especially for 
strong compression (i.e., low-bitrate encoding). This observa¬ 
tion motivates us to encode the keypoints and send them as 
side information along with such videos compressed to low 
bitrates. In the following experiments, we will use QP = {38, 
42, 46, 50} for H.264/AVC and H.265/HFVC encoding. 

III. Core ideas 

In our previous study HD’ we presented a keypoint encod¬ 
ing approach for still images. Applying this approach directly 
to individual frames in a video sequence would significantly 
increase the bitrate, as will be discussed in Section |V] To 
address this issue, similar to the conventional inter-frame pre¬ 
diction scheme in video coding, we propose several keypoint 
prediction approaches that significantly reduce the number of 
keypoints to be encoded and thus the bitrate required for the 
side information. 

A. Frame types for keypoint encoding 

The locations, scales, and orientations of keypoints detected 
in consecutive frames are related. If the keypoints are detected 
independently for each frame, then some of the keypoints 
may disappear or reappear across the frames as a result of 
the feature detection process. However, a keypoint that has 
disappeared from the previous frame may still yield a useful 
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Calculate matches and determine the frame types (D-frame, S-frame, U-frame, N-frame) 



Ti T2 • • • Ti Ti+1 No T before D-frame and N-frame 


Use affine transform matrix to predict the location, scale and orientation of keypoints in current frame 


Fig. 3. Examples of the four types of frames for keypoint encoding. 



Fig. 4. Keypoint encoding for video. 


descriptor for the current frame. To address this issue, the au¬ 
thors of IE proposed a temporally coherent keypoint detector 
in which the detected patches are propagated to consecutive 
video frames. Unlike their approach, we extract keypoints 
and use them to predict the keypoints (locations, scales, and 
orientations) in consecutive video frames because, in our case, 
the video itself is also available. Here, we introduce four 
different types of frames for keypoint encoding. These frame 
types are illustrated in Fig. Two of the four frame types are 
similar to those proposed in llE- Detection frames {D-frames) 
and Skip frames (S-frames). For D-frames, the keypoints are 
extracted using a conventional feature detection process. For 
S-frames, all keypoints are estimated using the keypoints from 
the previous frame, and the direct encoding of the keypoints is 
skipped. Therefore, S-frames enable a signihcant reduction in 
the amount of side information, as discussed in the following 
section. After a certain number of frames, it might no longer 
be possible to estimate the keypoints of the current frame using 
the previously identihed keypoints because, for example, the 
object of interest is leaving the field of view or the estimated 
keypoints do not yield effective descriptors calculated from 
the current video frame. To address this case, we add a third 
type of video frame, termed an Update frame (U-frame), 


to update the keypoints. In contrast to D-frames, U-frames 
combine both conventionally detected keypoints and forward- 
estimated keypoints because a few of the estimated keypoints 
are still sufficient for calculating the descriptors. However, we 
hnd that if the scene is moving quickly, then even U-frames 
will generate a large number of bits. To resolve this issue, we 
add a fourth type of frame, called a Null frame (N-frame), 
for which keypoint encoding and transmission are switched 
off. For N-frames, no side information is transmitted, and the 
features are extracted exclusively from the compressed frames. 
The method of determining the frame types and the keypoint 
encoding processes for the different frame types are explained 
in the following sections. 

B. Keypoint encoder 

As previously described, we predict the keypoints for S- 
and U-frames from the keypoints in the previous frame, as 
illustrated in Fig. Note that the keypoint decoder follows 
the reverse process. Below, we dehne the terminology that will 
be used throughout this paper for clarity. 

• Estimate means to estimate the locations, scales and orien¬ 
tations of the keypoints in the current frame using keypoints 
from previous frames. 
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• Update means to estimate the keypoints and also use 
differential keypoint encoding to update the keypoints. 

• Predict means to either estimate or update the keypoints. 
Note that although estimate and predict are synonyms, they 
are used for different purposes in this paper. 

• In Intra mode, the keypoints are extracted by conventional 
means from the current frame without reference to keypoints 
from previous frames and are encoded using a process similar 
to that presented in US- 

• In Skip mode, the keypoints are estimated and then stored 
in the buffer. Thus, differential keypoint encoding and trans¬ 
mission are skipped. 

• In Inter mode, the keypoints are estimated and the differ¬ 
ential encoder is used to update their locations, scales and 
orientations. Afterward, they are stored in the buffer. 

Note that only the Intra mode is used to encode the 
keypoints for D-frames; only the Skip mode is used for S- 
frames; all three modes (the Intra, Skip, and Inter modes) are 
used for U-frames; and no keypoint encoding is performed for 
N-frames. We will explain the components of the diagram in 
Fig. 0 in detail in the following sections. 

IV. Keypoint prediction 

When the camera or the object in a video is moving, the 
locations, scales and orientations of the detected keypoints are 
also gradually changing. Fig. shows the keypoints detected 
in frame 1 and frame 20 of the video sequence Barry White. 
We can see that the keypoints are still closely related. The 
red squares indicate a pair of related keypoints. However, the 
location, scale and orientation of the keypoint have changed; 
thus, we must predict the keypoints for the current frame using 
the previous keypoints. 


A, which is set to a fixed number (e.g., 5 or 20) in our first 
experiment. 

Two video frames have a relationship that can be locally de¬ 
scribed by a geometric transformation, e.g., an affine transfor¬ 
mation or a perspective transformation. In our experiments, we 
assume that the relationship between two consecutive frames 
can be described by an affine transformation and transmit the 
corresponding affine transform matrix. The locations, scales, 
and orientations of the keypoints in the current frame can 
then be estimated using this transform and the keypoints 
from the previous frame. The features from the previous 
frame are stored in the Previous features buffer. The block 
Feature matching is used to calculate the affine transform 
matrix. In the literature, several affine transform matrices were 
determined in p5) for subregions of a video frame with the 
goal of improving the conventional rate-distortion performance 
in video coding. We can also use several affine transform 
matrices in the case that the video contains multiple objects of 
interest that are moving independently. In the Stanford MAR 
dataset, all of the test videos contain objects of interest that 
are moving in the same direction. Thus, similar to Q, we use 
only one common transform matrix in this paper. However, 
multiple matrices can be used for complex video scenes to 
improve performance. Next, we will explain the blocks labeled 
Affine transform matrix T and Keypoint estimation in Fig. 
in detail. 

B. Location, scale, and orientation estimation 

After obtaining the affine transform matrix, we estimate the 
keypoints in the current frame using the keypoints from the 
previous frame. The estimated location can be easily calculated 
as follows; 



Fig. 5. Keypoints from video frame 1 and video frame 20 in the example 
video Barty White. 


A. Feature matching and affine transformation 

As shown in Fig. we determine the frame types for 
keypoint encoding by matching the features of the current 
frame to the features of the previous frame. To simplify 
the discussion, let us first consider only D-frames and S- 
frames to describe our proposed framework. As introduced in 
the previous section, the keypoints in D-frames are detected 
without reference to previous frames (Intra mode). We estimate 
the keypoints for the subsequent S-frames using all keypoints 
from the previous frame. The keypoint estimation process for 
S-frames is represented by Skip in Fig. The number of 
frames between consecutive D-frames is the detection interval 
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where (x, y) is the keypoint location in frame / —I and {x, y) 
is the estimated location in frame /. (|^ represents the affine 
transformation in homogeneous coordinates. Here, T is the 
affine transform matrix, and we first decompose the transform 
matrix into two component matrices as shown in (ig. The 
second matrix can be further decomposed as follows; 



'1 o' 


C0s[4>) 

sin{4>) 


.9 1 


—sin{(j)) 

cos{(j)) 


(6) 


where the matrix on the left represents the scaling, the middle 
matrix represents the shearing, and the matrix on the right rep¬ 
resents the rotation. Note that this decomposition is not unique. 
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Here, ri and r 2 are the scaling factors in two directions, q is 
the shearing factor, and (jj is the clockwise rotation angle. By 
solving the four linear equations represented by (|^, we obtain 
the following: _ 

ri = \J 0? + 

ad — be 

^/a? + b'^ (- 7 ^ 

ac + bd 
^ ad — be 
(j) = atan{b, a) 



Fig. 6. Scaling, shearing, and rotation transformations derived from matrix 
A. 

These formulae are graphically illustrated in Fig. After 
the affine transformation, the circular keypoint in the first 
frame has taken on an elliptic shape in the second frame. 
Its area is determined by the scaling factors ri and r 2 and 
the shearing factor q in (|^. From Fig. we can see that the 
shearing and rotation transformations do not change the area, 
whereas ri and r 2 do affect the area. Because the keypoint 
of a SIFT feature has a circular shape, we assume that the 
elliptical area of an estimated keypoint should be the same as 
the area of the corresponding keypoint (the dashed blue circle 
with radius r) in the current frame that would be detected 
using the conventional detection process. From the formulae 
for calculating the areas of a circle and an ellipse, the following 
scaling factor is obtained: 

Trr^ = 7rrir2 

r = (8) 

r 



To summarize, the location F (a vector of x and y) of a 
possible keypoint in the current frame / can be estimated 
by applying T in (|^ to the location (a vector of x and y) 
of a keypoint in the previous frame / —I. This operation 
is expressed as follows: 


'if 

_ 'T' 

7/-r 

1 

— 1 ’ 

1 


The scale of the keypoint is estimated by multiplying the scale 
of the original keypoint, cr, by the scaling factor s as follows: 

= s ■ = ^/rir 2 ■ cr^~^ (10) 

Furthermore, the orientation is estimated by rotating the ori¬ 
entation as follows: 

§f=ef-^-(j) ( 11 ) 


Thus, the block labeled Keypoint estimation in Fig. |^can be 
detailed as shown in Fig. 



Fig. 7. The location, scale and orientation estimation. 


We perform an initial experiment to justify our method of 
estimating the location, scale, and orientation of a keypoint. 
In this experiment, we use the unquantized keypoints from the 
D-frames and the unquantized affine transform matrix T. The 
descriptors are calculated from the uncompressed video frames 
using the estimated keypoints in the S-frames to demonstrate 


the accuracy of the estimation. As described in Section II-A 


eight videos, each containing a single static object, are used; 
the SIFT descriptors from the video frames are compared with 
those extracted from the corresponding reference images, and 
the average numbers of matching features are recorded. 


TABLE I 

Evaluation of location, scale, and orientation estimation eor 

KEYPOINTS 


Adaptation 

Avg. # matches 


A=5 

A=20 

loc. only, 

58.33 

49.20 

loc. + sc., p 


58.55 

50.98 

loc. + orient.. 

9) + (11 


58.96 

57.64 

loc. + sc. + orient., i9J to 

ill 

59.19 

59.13 

Independent detection 

59.13 


We perform this experiment to verify the performance of 
the blocks labeled Affine transform. Scale adaptation, and 
Orientation adaptation in the diagram shown in Fig. In 
Table [Ij loc. only refers to the case in which only the keypoint 
locations are estimated using (|^, whereas the scales and 
orientations are simply estimated as the scales and orientations 
of the previous keypoints, i.e., and 9-f = 9^~^. 

The notation loc. -t- sc. refers to the case in which the scales are 
also modified using a scaling factor, as in ( [TOl l. The notation 
loc. + sc. + orient, indicates the case in which the locations, 
scales, and orientations are all appropriately transformed. In 
addition, the results of conventional keypoint detection and 
descriptor calculation for each separate frame are provided for 
comparison (Independent detection). In this case, all frames 
are D-frames, and the keypoints are detected independently 
for each frame. Table |I] shows that the matching performance 
is improved by applying ( [TOl l and o, indicating that our es¬ 
timation method is effective. Note that for a detection interval 
of A=5, loc. only also achieves a high number of matches, and 
the improvement achieved for loc. + sc. + orient, seems quite 
small because of the small detection interval and because the 
test videos each contain only a single object. However, when 
the detection interval is large (A=20) or the video content 
is rapidly changing, loc. only and loc. - 1 - sc. can easily fail 
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to produce correct descriptors. Accordingly, we can see that 
loc. + sc. + orient, achieves much better performance in the 
third column of Table In addition, two possible reasons that 
the keypoints estimated using (|^ to ( |TT] i slightly outperform 
the independently detected keypoints are that the RANSAC 
computation yields inconsistent numbers of matching features 
and that keypoints may therefore disappear or reappear in 
consecutive frames as a result of the feature detection and 
feature selection processes. The estimated keypoints, which are 
absent in conventional feature detection, are still sufficient to 
extract effective descriptors. As a result, the average number of 
matching features obtained from the estimated keypoints could 
be slightly higher than that obtained using the independent 
detection method. In the following experiments, we use the 
estimated keypoints (location, scale, and orientation) to extract 
descriptors. 

C. Quantization of the affine transform matrix 

The affine transform matrix contains real-valued numbers; 
therefore, we must perform lossy encoding for this matrix. The 
affine transform parameters are quantized via scalar quantiza¬ 
tion in | |25| . The authors of 0 encode the affine transform 
matrix T using differential keypoint location coding. They 
uniformly quantize the parameters a, b, c, d, and ty in Q 
using 7 bits, resulting in 42 bits in total for the matrix T. By 
contrast, we propose to quantize the parameters ri, r 2 , q, and 
(p in (0. The associated keypoints in consecutive frames have 
similar scales and small differences in orientation. We hnd 
that the parameters and r 2 lie within the range [0.9, 1.1], 
the parameter q remains within the range [-0.05, 0.05], and f 
is within the range [-0.15, 0.15]. This is because the affine 
transform matrix T is calculated between two consecutive 
frames. To fairly compare our method with the quantization 
method proposed in Q, we also assign 42 bits to the matrix T. 
For Ti and r 2 , we use 6-bit quantizers, and for q, we assign 7 
bits. We observed in our experiment that the quantization of f 
strongly affects the matching performance; therefore, 9 bits are 
assigned to this parameter. For tx and ty, we assign the same 
number of bits as in 0 (namely, 7) to enable the exclusive 
comparison of the quantization methods for the matrix A 
in 0. Subsequently, to improve the matching performance, 
we increase the allotted space for encoding the matrix T to 
48 bits, i.e., 7, 7, 7, 9, 9, and 9 bits for ri, r 2 , q, f, tx, and ty, 
respectively, using uniform quantization. The quantized T can 
be obtained from the quantized parameters in (|^. We use a 
detection interval of A=5 in our experiments for illustration, 
and the results are shown in Table |II] As can be seen, we 
achieve slightly better results than the method in 0 - Our 
result for 48 bits approaches the performance achieved using 
the unquantized T. In the following experiments, we use the 
quantized T in place of the uncompressed T, using 48 bits to 
encode the affine transform matrix. 

D. Adaptive detection interval 

In the previous section, we presented experiments per¬ 
formed using a hxed detection interval A. However, a hxed 
detection interval will not be adequate when objects are 


TABLE II 

Comparison of different quantization methods for the aefine 

TRANSEORM MATRIX T 


Quantization 

Avg. # matches 

Uncompressed T 

59.19 (Table 1 A=5) 

Method of js] (42 bits) 

59OT^ 

Our method (42 bits) 

59.06 

Our method (48 bits) 

59.13 


entering or leaving a scene because the descriptors extracted 
from certain of the estimated keypoints will become spurious. 
Additionally, when the object of interest does not change 
across a large number of frames, a hxed detection interval 
may result in a new set of conventionally detected keypoints 
being sent sooner than is necessary. To address these issues, 
we propose to use an adaptive detection interval. Specihcally, 
we will insert a D-frame or a U-frame when the keypoints 
from the previous D- or U-frame are insufficient for feature 
extraction in the current frame. 

1) Adding D-frames or U-frames adaptively: The proposed 
process for determining the frame type is as follows. For a 
new frame, we hrst extract the descriptors for the estimated 
keypoints and compare them with the features from the 
previous D- or U-frame. 

• If the affine transform matrix cannot be calculated or is 
incorrect, this indicates that a new scene has begun because 
the estimated keypoints cannot produce correct descriptors. 
Therefore, we specify a D-frame for keypoint encoding. 

• If the number of matches is greater than a certain threshold 
(e.g., e = 80%) with respect to the number of features in the 
previous D- or U-frame, then most of the estimated keypoints 
can be considered effective for the current frame. Therefore, 
an S-frame is specihed. Note that the threshold e signihcantly 
affects the determination of S-frames and the bitrate of the 
side information. For the test dataset, the selected value is 
reasonable; however, it can be adjusted for other datasets or 
applications. 

• If several of the keypoints are still valid, although the 
object of interest is leaving the current scene, another object is 
entering the scene, or many keypoints deviate from the actual 
keypoints after a large number of frames, then we designate 
the current frame as a U-frame. 

The reason that we compare the features of the current 
frame with the features of the previous D- or U-frame is 
that the initial keypoints propagated to the intervening frames 
originate from this previous D- or U-frame. After determining 
the frame type, we calculate the affine transform matrix T 
between the previous frame and the current frame if the frame 
is designated as an S- or U-frame. Note that we do not use 
the affine transform matrix between the current frame and the 
previous D- or U-frame because the compression of this matrix 
requires a higher bitrate. An individual T for each S- or U- 
frame is then quantized and transmitted as side information. 
Transmitting T is not required for D-frames. 

2 ) Comparison of adaptive and fixed detection intervals: 
In this section, we compare the adaptive-interval and hxed- 
interval schemes. In this experiment, the previous settings are 
used; a hxed A = 5, quantization of the affine transform matrix 
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using 48 bits, and conventional feature detection for D- or 
U-frames. Note that we still use the original (unquantized) 
keypoints for the D- and U-frames because the objective 
here is to evaluate the performance of our adaptive detection 
interval scheme for uncompressed video frames. In addition, 
we add an independent detection scheme to detect features for 
each frame individually in a conventional manner. From Fig. 
we can see that the performance of the fixed-interval scheme 
is similar to that of the independent detection scheme because 
the interval A is quite short. The green curve represents 
the performance of the adaptive-interval scheme, and the 
red dots represent the D- and U-frames. Interestingly, the 
keypoints from the D-frames strongly affect the descriptor 
calculation and matching in the consecutive S-frames. As a 
result, the adaptive-interval scheme sometimes outperforms 
the independent detection scheme (e.g., at frame 70), but it 
is inferior, for example, at frame 80. 


Monsters Inc 



Fig. 8. Matching performance comparison among independent detection, a 
fixed detection interval (A = 5), and an adaptive detection interval (example 
video: Monsters Inc.). The red dots indicate the frames at which conventional 
detection is applied in the adaptive-interval scheme. 


E. Switching off keypoint encoding and transmission 

When the camera or one or more objects is moving quickly, 
the number of features matched in the next frame could 
fall below the threshold e = 80%. In this case, we need to 
designate a U-frame. Thus, the bitrate of the keypoints will 
be signihcantly increased by the necessity of specifying many 
U-frames for rapidly moving scenes. This situation can be 
mitigated by reducing the threshold e; however, because this 
threshold is used to indicate the percentage of well-estimated 
keypoints, its value should not be set too low. To reduce the 
bitrate of the side information, as shown in Fig. we add 
N-frames for quick scene changes. This is motivated by the 
observation that the video frames for a fast-moving scene are 
less relevant to the computer vision algorithm run at the server. 

The method for determining an N-frame is as follows. Once 
we have a D- or U-frame, we assume that a new scene or 
an updated scene is coming. However, if the scene is too 
short, we may need to add many D- or U-frames, which 
would introduce a signihcant increase in bitrate. Therefore, 
we need to check the length of the scene corresponding to 
the current D- or U-frame. If the subsequent Ng frames are 
not S-frames, then the current D- or U-frame is changed to 
an N-frame. This indicates that the scene corresponding to 
the current frame does not remain stable across Ng+l frames. 
We skip the keypoint encoding for this frame, and the next 
frame is assumed to be a new D-frame. This process is then 
performed again for the new D-frame. We switch off the 
keypoint encoding and transmission for such N-frames, and at 
the server, the features for these frames are directly extracted 
from the decoded frames. We will discuss the N-frames 
selected in a retrieval experiment using Multiple Objects video 


sequences in Section VI-B 


Table III shows the average number of matches and the 
average number of D- or U-frames over all videos. The hxed- 
interval scheme achieves similar matching performance to that 
of the independent detection scheme. In the dataset we are us¬ 
ing, each video contains only one object and no scene changes. 
Therefore, the performance of the hxed-interval scheme does 
not suffer. The performance of the adaptive-interval scheme is 
somewhat reduced because of the existence of more S-frames; 
however, the number of D- and U-frames is signihcantly 
reduced. This will result in a signihcant reduction in the 
amount of the side information, as shown in Section |V-C| 
Moreover, because of the manner in which we insert D- or U- 
frames, our adaptive-detection-interval scheme is still suitable 
for a rapidly changing scene. We will discuss experiments 


using videos containing scene changes in Section VI-B 


TABLE III 

The average numbers oe matches and oe D- and U-frames over 
ALL videos. 


Scheme 

Avg. # matches 

Avg. # D/U-frames 

Indep. detect. 

59.13 (TableJ^ 

100 

Fixed interval (A=5) 

59.13 (Table |ll| 

20 

Adap. interval (e = 80%) 

57.95 

4.63 


V. Keypoint encoding and transmission 

In the previous sections, we presented several experiments 
to justify the feasibility of using predicted keypoints, quan¬ 
tizing the affine transform matrix, and using an adaptive 
detection interval. In addition, we proposed switching off 
the keypoint encoding and transmission when the scene is 
moving quickly to reduce the number of D- or U-frames. 
As a result, the number of keypoints that must be encoded 
is signihcantly reduced. Next, we will describe the keypoint 
encoding approaches used for the different types of frames. 
First, we use 2 bits to indicate the frame type for each frame. 
For D-frames, similar to the approach used in our previous 
study of keypoint encoding for still images, we quantize 
the keypoint locations into integer values and use sum-based 
context-based arithmetic coding to encode them. We use 12 
bits for scale and orientation encoding. In our experiments, 
this keypoint encoding method is referred to as the Intra mode. 
For S-frames, we use the estimated keypoints directly and send 
only the quantized affine transform matrix, which requires 48 
bits. This procedure is denoted by Skip mode in Fig. For 
U-frames, we use three modes to encode the keypoints: the 
Intra, Skip and Inter modes. The Inter mode is identical to 
the differential keypoint encoding mode shown in Fig. If 
this mode is selected, then the differences between the scales. 
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locations, and orientations of a matched pair are encoded 
and transmitted. Note that for an N-frame, the encoding and 
transmission of the keypoints and the affine transform matrix 
T are skipped. 

A. Intra mode keypoint encoding 

1) Keypoint quantization: Because the locations, scales, 
and orientations of the keypoints are represented using 
floating-point numbers, they must be quantized for efficient 
transmission. We apply different quantization methods for 
each. 

For the locations, one previously proposed method uses 
a spatial grid laid over the original image with step sizes of 4, 
6, and 8 pixels. Thus, the locations are quantized by a factor of 
4, 6, and 8, respectively. In another study JZT) , the locations 
were quantized into integer values, meaning that they were 
quantized by a factor of 1. In general, locations are quantized 
as follows; 

k = f-round{j) (12) 

where / is the quantization factor. In our experiments, we 
determine which quantization factor should be used such that 
the quantization of the locations does not significantly affect 
the number of feature matches. Based on the literature m, 
we set / to 1 for location quantization. 

In SIFT, the scale ai can be represented as follows; 

a, = + Aa (13) 

where ctq is a base scale offset (i.e., 2.0159). o is an octave 
ranging from 0 to a number x that depends on the size of the 
image {x is less than 4 for our dataset), and s is an integral 
scale in the range [0, 2]. Thus, 3 and 2 bits are sufficient to 
represent o and s, respectively. Act is an offset calculated to 
increase the accuracy of the scale estimates in SIFT keypoint 
detection. In this process, a quadratic polynomial is fit to the 
values of the detected scale-space extremum (cto 2 (°+'*A)) to 
localize more accurate scales with a resolution that is higher 
than the scale sampling density. Thus, the value Act is related 
to the scale-space extremum (cto 2 (°+®A)) We calculate the 
difference between the detected scale at and its corresponding 
scale-space extremum (cto 2 (°+®A)) We then normalize the 
difference as follows; 

Act„ = (ct, - cto2(°+'*/3))/cto2(°+^/3) (14) 

Following our previous study |jT9), we encode the normalized 
difference using the Lloyd-Max quantization algorithm. We 
assign 1 bit to Act„. Thus, the scales, including o, s, and 
Act„, are assigned 6 bits in total. 

Similar to the location quantization, the orientations are 
quantized as follows; 

Q 

E{6i) = roundii^ + 0.75) x (2‘ - 1)) 

27T 

0^= 0 . 75 ) -271 (15) 

where 0.75 is an offset that ensures that the values -1-0.75) 
lie on the interval [0, 1) in the Vlfeat SIFT implementation. 


Then, the index E{9i) can be represented by values on the 
interval [0, 2*) and can be encoded via fixed-length coding 
with t bits. In our experiment, t is set to 6, thereby quantizing 
the orientations into 64 levels. 

With this approach, the quantized keypoints yield feature¬ 
matching performance similar to that of the ideal, uncom¬ 
pressed keypoints. In total, the encoding of the scale and 
orientation for one keypoint requires 12 bits. 

2) Context model for location coding: As explained in 
the previous section, we quantize the original locations into 
integer values (quantization factor 1). To encode the quantized 
locations, we use the same approach used in CDVS Q, i.e., 
sum-based context-based arithmetic coding. The number of 
bits required for location encoding depends on the number 
of keypoints to be encoded and their distribution. In sum- 
based context-based arithmetic coding, we must first train the 
context model. Similar to the approaches used in a previous 
study 1^ and in CDVS Q, we use the INRIA Holidays 
Datasej^nd the Caltech Building Datasej^ for training. Note 
that the joint dataset comprises 1741 images. Because the 
location quantization factor is set to 1, the block width is also 
correspondingly set to 1. Based on the results of our previous 
study we select a context range of 49, and the quantized 
locations are encoded using the corresponding trained model. 

3) Keypoint encoding: We encode the keypoints using the 
previously described levels of quantization for their locations, 
scales, and orientations. If all encoded keypoints are sent 
as side information, then the extracted descriptors can be 
expressed as follows; 

di = T'(fci|/), with ki = Dec{Enc{ki)) (16) 

where Enc{-) and Dec{ ) denote the encoding and decoding, 
respectively, of the keypoints. 

B. Keypoint encoding for U-frames 

For U-frames, we use three modes to encode the keypoints; 
the Intra, Skip and Inter modes. The Inter mode is the 
differential keypoint encoding mode shown in Fig. The 
differences between the scales, locations and orientations of 
a matched pair are encoded and transmitted. These steps are 
described as follows. 


Step 1. Find the matches between two consecutive 
frames using the NNDR matching strategy. We denote 
the matched keypoints in the previous frame by = 
{ki, ...,kf, ...,k‘f} and the matched keypoints in the 
current frame by = {k’l,k \,..., fc^}. 

Step 2. Use the quantized affine transform matrix 
to estimate the locations (0, scales ( [T0| , and orienta¬ 
tions 0 of the keypoints in the current frame from 
the keypoints in the previous frame. We denote the 
keypoint estimation by = {fcj,..., ..., fc^}. 


’http://lear.inrialpes.fr/people/jegou/data.php 
^http://vision.caltech.edu/malaa/datasets/caltech-buildings/ 
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Step 3. Calculate the differences to obtain Ak = — 

Aai = CTj — (7^ and A0i = 9^ — 0\ for each pair {k\, k\} 
of {K'^.K'"}. 

Step 4. Remove incorrect matches if the differences 
are too large, i.e., Ali > 16, abs{Aai)/a\ > 0.3, 
or abs{Q{A9i)) > 4, where Q{-) denotes the index 
difference derived from (HD- As a result, correct matches 
have the following properties: Ali lies on the interval [- 
16, 16]; the index of Auilci\ lies on the interval [0, 5], 
where it is quantized into hve levels; and Q{A9i) lies on 
the interval [-4, 4]. 

Step 5: a) The Skip mode is used if the matched key- 
points satisfy the following three conditions: 1) Ali <= 
1 ; 2) the index of Aaija^ is equal to 2, which means that 
the scale has not changed; and 3) Q{A9i) = 0. 

b) The Inter mode is used for the differential coding 
of any other correctly matched keypoints. The differential 
values are encoded via arithmetic coding. 

c) The Intra mode is employed for non-matched fea¬ 
tures and incorrectly matched features, which are treated 
as features corresponding to new scene content. These 
keypoints are encoded in the same manner as the key- 
points in D-frames. 


Note that the bitrate for keypoint encoding is determined by 
the threshold e that is used to determine the frame types and 
by the quantization factors that are used for each quantity in 
the Intra and Inter modes. 


C. Bitrate comparison 

We encode the keypoints using three different schemes and 
compare the resulting bitrates. First, all frames are treated 
as D-frames, which means that all features are independently 
extracted using the decoded keypoints. Second, only the hrst 
frame is considered to be a D-frame, with the following frames 
being U-frames. Third, S-frames and an adaptive detection 
interval are added. We compare these schemes to demonstrate 
the potential bitrate reduction offered by the proposed ap¬ 
proach. Fig.j^shows the results for the three different schemes. 
The introduction of S-frames and an adaptive detection interval 
leads to a signihcant bitrate reduction (by a factor of 18 
compared with D-frames only), but the matching performance 
is not signihcantly affected. 

Fig. 10 shows the number of keypoint bits for each frame 
of the video sequence Wang Book. The adaptive detection 
interval reduces the number of D- or U-frames by exploiting 
the keypoints that remain coherent across frames. For an S- 
frame, we use 48 bits for the affine transform matrix and 2 bits 
to indicate the frame type. For U-frames, many keypoints are 
encoded in the Intra or Inter mode, therefore requiring a large 
bitrate. However, incorrect keypoints or deviated keypoints are 
corrected by adding U-frames. 


VI. Experimental results 
In the previous section, we described how to encode the 
keypoints for the different types of frames. Thus far, the 



Eig. 9. Left: matching performance comparison for the D-frames only 
scheme, the U-frames added scheme and the S-frames added scheme over 
all videos. Right: bitrate of the side information. 
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Eig. 10. Number of keypoint bits used for each frame (example video: Wang 
Book). 


encoded keypoints have been used to extract SIFT descriptors 
from the original video frames. Here, we add the encoded 
keypoints as side information for compressed videos that are 
encoded with different QP values. The pairwise matching and 
retrieval performances are compared with those for standard 
H.265/HEVC-encoded videos. 


A. Pairwise matching results for videos with different QP 
values 


Eirst, we perform pairwise matching for the eight videos 
presented in Section II-A The blue line in Eig. 11 shows 
the final matching performance of the proposed approach 
as a function of the total bitrate used for the compressed 
video plus the keypoint side information. The number of 
matches for a given bitrate budget is significantly improved. 
In general, a 5x bitrate reduction is achieved compared with 
conventional H.264/AVC encoding. This result is better than 
those for the patch-encoding approaches presented in |14| 
(2.5 x) and 0 (4x). Note that in the cited studies, the bitrate 
reduction was also calculated in comparison with the perfor¬ 
mance of H.264/AVC encoding; therefore, the comparison is 
quantitatively fair. Unlike these patch-encoding approaches, 
we provide standard-compatible videos in addition to the 
locations, scales, and orientations of the keypoints for a 
geometric consistency check. The keypoints of a correctly 
matched image pair should have correlated locations, scales 
and orientations. Thus, the keypoint information is valuable 
for eliminating outliers and increasing precision. 
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Fig. 13. The upper plots show the keypoint bits for individual video frames. The lower plots show the pairwise matching performance between the video 
frame and the top retrieved image. 


60 r 



Fig. 11. Matching performance comparison for various approaches (the blue 
line represents the proposed approach). The bitrate includes both the encoded 
keypoints and the F[.265/F[EVC-encoded videos. 



Objects 1 Objects 2 


Fig. 12. PAO values for two Multiple objects video sequences. 


B. Retrieval results 

In our previous experiments, we compared the number 
of preserved features using pairwise matching. Notably, in 
content-based image retrieval systems, performance typically 
improves with an increasing number of preserved features. To 
evaluate the retrieval performance of our proposed scheme, 
we use video sequences in the Stanford MAR dataset that 


show multiple objects. Each Multiple Objects video consists 
of 200 frames and contains three different objects of interest. 
The first two Multiple Objects videos are used in our retrieval 
experiments. Excluding the fast spatial matching component, 
we use a previously proposed image retrieval system |29|. We 
use the MIRELICKR-25000 database and the 23 reference 
images from the Stanford MAR dataset as the training dataset. 
Similar to HD. we extract up to 300 SIET descriptors for 
each image in the database and train one million visual words 
(VWs) from these descriptors. Eor the test frames, we extract 
200 SIET features and pass them to the retrieval engine. 
After obtaining a shortlist of candidate matching images 
from the retrieval system, we run RANSAC on the top 100 
images in the shortlist to reorder these retrieved images for 
improved precision. We run the retrieval for each frame of 
the Multiple Objects videos. As noted in a previous study ||^, 
this operation is redundant because the retrieval results for 
consecutive frames are closely related. However, the objective 
of this experiment is to examine the performance of different 
approaches in a scenario wherein an object of interest is 
leaving or entering the scene. Note that in this experiment, the 
threshold e is set to 80% (Section |IV-D| ), N-frames are used, 
and Ns (Section IV-Ei is set to 4 for these video sequences. 
We encode the videos using H.265/HEVC. Note that we use 
only a QP value of 46 (Section |II-B| i. Three approaches are 
compared, i.e., feature extraction from the uncompressed video 
frames, feature extraction from the compressed video frames, 
and feature extraction using the encoded keypoints. 


1) Precision at One: Similar to 0. we first plot the 
Precision at One (PAO). The PAO is defined as the ratio 
between the number of correctly retrieved images in the top 
position and the total number of frames used for retrieval. Note 
that not all 200 frames are used to calculate the PAO. Only 
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the frames whose locations are tagged in the ground-truth hies 


are used. Fig. 12 shows the PAO values for the three tested 
approaches for the Multiple Object videos. The proposed 
approach based on encoded keypoints yields a signihcant 
improvement in terms of the PAO. The bitrates of the two 
encoded video sequences are 37.08 kbps and 23.44 kbps, 
and the bitrates for the encoded keypoints are 7.55 kbps and 
9.11 kbps, respectively. As seen from the hgure, adding the 
encoded keypoints as side information signihcantly improves 
the retrieval performance. In our experiment, we hnd that the 
proposed approach offers superior performance compared with 
videos encoded using a smaller QP value at the same bitrate. 

2 ) Number of matches in top retrieved images: The upper 
plots of Fig. [T^ show the frame types and numbers of bits 
for individual video frames. The red, green, blue, and black 
dots represent the keypoint bits for D-, S-, U-, and N-frames, 
respectively. The keypoint bits for D- and U-frames vary 
for different video frames, whereas each S-frame requires 50 
bits and each N-frame requires 2 bits, thereby resulting in 


a large bitrate reduction. The lower plots of Fig. 13 show 


the number of matches achieved using the pairwise matching 
scheme only when the system retrieves the correct image in 
the top position. A value of zero indicates that the retrieved 
image is not in the top position or that no matching image 
is identihed for the video frame. In general, the proposed 
approach yields an increased number of matches. In the 
lower right plot, because of the signihcant amount of glare 
on the Barry White CD cover, many spurious features are 
extracted across consecutive video frames. These keypoints 
cannot be propagated to consecutive video frames. Therefore, 
many frames are selected as N-frames, for which keypoint 
encoding and transmission are skipped (e.g., video frames 95 
to 125), resulting in impaired matching performance for Barry 
White. The results reported in Q show similar performance 
impairments. In addition, the glare on the CD cover causes 
the selection of a greater number of D- or U-frames (e.g., 
video frames 49 to 90) when encoding the keypoints. There is 
a drop between frame 155 and frame 195 compared with the 
uncompressed video, which occurs because the frames during 
this portion of the video are all of the S-frame type. In Fig. 
the number of matches in the current S-frame is closely related 
to the number of matches in the previous D- or U-frame. From 
the drop observed in the hgure, we can see that the same is also 
true here. The performance in this portion of the video could 
be improved by adding a U-frame (i.e., tuning the parameters). 

The top detected images for the three approaches differ for 
certain video frames. For example, as shown in the lower 
left plot in Fig. [T^ the proposed approach detects Wang 
Book correctly from frame 130 to frame 137, whereas the 
other two approaches detect no relevant images. By contrast, 
the other two approaches detect Monsters Inc. from frame 
144 to frame 146, but the proposed approach fails to detect 


any relevant image. Fig. 14 presents two example frames to 
illustrate the results. Note that these frames are not included 
in the calculation of the PAO because the transition sections 
of the videos are not included in the ground-truth hies in the 
Stanford MAR dataset. 



Fig. 14. Transition frames from Wang Book to Monsters Inc.. Left: frame 
135, where the proposed approach correctly detects Wang Book and the other 
two approaches fail. Right: frame 145, where the proposed approach fails to 
detect the relevant image and the other two approaches successfully detect 
Monsters Inc.. 


3) Retrieval results for 720p video sequences: The Stanford 
MAR dataset used in the previous experiments is quite small 
and simple. Therefore, in this retrieval experiment, we use two 
720p format video sequences | [28) that contain rich features 
to evaluate our proposed approach: 720p50_mobcal_ter and 
720p5994_stockholm_ter. Note that four frames of each video 
are gray frames, which are removed before coding. We dis¬ 
played the hrst frame of each video on a monitor and acquired 
an image of it using a mobile phone. The resulting images 
therefore deviate considerably from the original as a result of 
noise, perspective transformation, illumination changes, and 
so on. The processed images, shown in Fig. 15 are used as 
reference images and integrated into the database used in the 
previous retrieval experiment to form the training database. 



Fig. 15. Reference images derived from the first frames of each 720p video. 


As in the previous experiment, the 720p format videos are 
encoded using H.265/HEVC with a QP value of 46. The 
keypoints are extracted from the uncompressed video and 
encoded as side information. Because of the relatively large 
size of 720p videos, we extract 300 SIFT features from each 
frame and pass them to the retrieval engine used previously. 
The other settings are the same as those used in the previous 
retrieval experiment. The bitrates of the two encoded video 
sequences are 64.50 kbps and 41.74 kbps, and the bitrates of 
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Fig. 16. The upper and lower plots show the keypoint bits for individual video frames and the pairwise matching performance between the video frame and 
the top retrieved image, respectively. 


the encoded keypoints are 9.20 kbps and 6.16 kbps, respec¬ 
tively. We plot the number of matches between the correctly 
retrieved image (i.e., the reference image is in the top position) 
and the video frames for the uncompressed videos, the HEVC- 
encoded videos and the videos encoded using our proposed 
approach in Fig. 16 (only the results for the first 250 frames 
are shown). It is apparent that using the encoded keypoints 
significantly improves the retrieval performance. Note that 
the content of the first frame gradually disappears in the 
following frames, and therefore, the number of matches should 


decrease through consecutive frames. However, in Fig. 16 


the number of matches obtained using our scheme increases 
significantly. As noted before, the number of matches in 
subsequent S-frames strongly relies on the number of matches 
in the previous D-frame, as shown in the figure. In addition, we 
can see that additional frames can be correctly retrieved with 
the addition of side information, e.g., frames 175 to 200 in the 
plot on the lower left and frames 95 to 139 in the plot on the 
lower right. Compared with the results of the previous retrieval 
experiment, it is more clearly demonstrated here that the 
predicted keypoints yield correct descriptors calculated based 
on these transition frames. It should be noted that the matches 
identified by our proposed scheme significantly outnumber the 
matches for the uncompressed video. This can be explained 
as follows. Only 300 features are extracted from each 720p 
video frame; therefore, they are highly sparse. Then, a few 
even stronger features are detected from the coming scene, 
which are identified as the top 300 features. The previous 
scene is still present in the video content; thus, the predicted 
keypoints can yield more valid descriptors than the scheme 
using the uncompressed video because the first frame is the 
reference image. Considering the resolution of 720p videos, 
the scenario depicted in Fig. 16 can rapidly change if more 
features are detected, i.e., if the number of features is sufficient 
or if they are sufficiently distributed across the image. Note 


that no N-frames are specified among the first 250 frames. 
Fig. 17 shows two video frames for which only our approach 
returned a correct match in the top position (see the lower 
plots in Fig. [16)1. We can see that the contents of the query and 
reference images (see Fig. [fS] ) overlap to a large extent. Note 
that the subsequent video frames still contain many features 
that match with the relevant reference image; however, the 
relevant reference frame is not returned in the top position. 
Therefore, the numbers of matches are not shown in Fig. [T6| 



Fig. 17. Examples of retrieved transition frames. Top: frame 200 of 
720p50_mobcal_ter. Bottom: frame 139 of 720p5994_stockholm_ter 


C. Discussion 

1) Bitrate reduction for keypoints: We obtained our results 
using heuristically selected parameters based on the statistics 
for keypoint encoding. In the Intra mode procedure presented 
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in Section |V-A[ we quantize the locations using a factor 
of 1 and use 12 bits to encode the scales and orientations. 
Note that we can modify these parameters to achieve a larger 
bitrate reduction for D- and U-frames. In the Inter mode 
procedure presented in Section V-B the differential values 
of locations, scales, and orientations can be quantized using 
larger quantizers to further reduce the bitrate for U-frames. 
The threshold e (Section [IV-D[ | determines whether the current 
frame is designated as a U-frame. Therefore, this parameter 
strongly affects the length of a series of S-frames. A lower e 
value results in a larger number of S-frames for the current D- 
or U-frame, which yields a lower bitrate for keypoint encoding. 
Note that this value should not be too small because overly 
small values will affect the matching performance. The value 
of Ns (Section [IV-E| | is used to check the number of S-frames 
associated with the current D- or U-frame. If the length of the 
window is too short, then the current D- or U-frame is changed 
to an N-frame to eliminate the encoding and transmission 
of keypoints in rapidly moving scenes. Furthermore, in our 
experiments, it was determined that the acceptable Ng value 
ranges from 3 to 12, depending on the frame rate. In our 
previous work m), we proposed removing spurious keypoints 
and duplicated keypoints GD to further lower the number 
of keypoints to be sent to the server. Note that a spurious 
keypoint for the current frame could be a useful keypoint for 
subsequent frames; therefore, we do not directly apply this 
approach to videos. However, removing duplicated keypoints 
with respect to the keypoints extracted from the compressed 
video frame is still feasible. To summarize, there is always 
a trade-off between the matching performance and the bitrate 
for keypoint encoding. Note that we present only the results 
obtained using the parameters selected in the previous sections, 
which appear reasonably effective for improving the matching 
performance. 

2) Reusing previous descriptors: For S-frames or in the 
Skip mode for U-frames, the SIFT descriptors are calculated 
again at the server based on the estimated keypoints from the 
compressed frames in our experiments. The purpose of this re¬ 
calculation is to verify the correctness of the proposed keypoint 
estimation approach. However, the descriptor calculation pro¬ 
cess can be skipped by directly using the previous descriptors 
at the server, i.e., the descriptors for the S-frames and some 
descriptors for the U-frames can be reused from the descriptors 
calculated for previous frames. Thus, the computation time 
at the server can be reduced. In our experiments, from the 
pairwise matching results shown in Fig. 11 and the retrieval 
results shown in Fig. it can be seen that the extraction of 
descriptors from predicted keypoints remains effective. 

3) Uploading frames and encoded keypoints: In certain 
applications, when a mobile device captures a video, it is 
unnecessary to upload all video frames. To address this 
scenario, 0 discusses two schemes, both including a Re¬ 
trieval State and a Pair-wise State. In the Retrieval State, the 
descriptors extracted from the frame are used to retrieve a 
relevant image. In the Pair-wise State, the descriptors from 
the current frame are compared with the previously retrieved 
image using the pairwise matching approach. The authors 
suggest that better bitrate reduction can be achieved using 


On-Device Tracking 0 - In our proposed system, we can also 
perform a process that is similar to On-Device Tracking Q. 
When a new object is detected, the frame is HFVC Intra- 
encoded, and the keypoints are encoded using the Intra mode 
and sent as side information with the encoded frame. Note 
that this is similar to our previously proposed approach for 
images GD. which achieves improved performance at low 
bitrates. The server then performs image retrieval and sends the 
relevant image back to the mobile device. Then, the descriptors 
from the following frames are compared with the retrieved 
image. If no new object is found, then there is no need for 
data transmission between the mobile device and the server. 

4) Computational complexity: It should be noted that the 
processes of keypoint detection, feature extraction and match¬ 
ing, and keypoint encoding are all based on uncompressed 
video frames. Therefore, in practice, we can run these pro¬ 
cesses in parallel with the actual video compression, as shown 
in Fig. [T] to speed up the processing. In addition, to re¬ 
duce computational complexity, we can replace the descriptor 
extraction and descriptor matching processes with simpler 
detectors, descriptors, and matching procedures, e.g., we can 
heuristically determine which keypoints can be used in the 
next frame. This keypoint encoding process can be optimized 
to achieve a reduced computation time compared with the 
H.265/HFVC encoding time. Moreover, because the keypoint 
decoding is separate from the decoding of the video, the 
encoded bitstreams need not be transmitted simultaneously 
with the video. In certain applications, they can be transmitted 
later when the communication channel is not busy to improve 
the matching performance. 

VII. Conclusion 

Because of the adverse effects of video compression on 
feature-matching performance, we propose to encode the orig¬ 
inal SIFT keypoints from a video and transmit them along 
with the compressed video to the server. In this paper, we 
introduce four different types of frames for keypoint encoding 
based on considerations regarding different behaviors in con¬ 
secutive video frames. Then, we propose methods of predicting 
keypoints, quantizing the affine transform matrix, adopting an 
adaptive detection interval, and switching off the keypoint 
encoding when the scene is moving quickly to reduce the 
keypoint bitrate. We describe the Intra, Inter and Skip modes of 
encoding the keypoints. Finally, pairwise matching and image 
retrieval are performed. The results show that the proposed 
approach achieves improved performance in feature match¬ 
ing at a given rate. The proposed feature-preserving video 
compression approach is advantageous because a standard- 
compatible video can be watched or stored for future use, 
flexible feature types can be extracted, and the orientations and 
scales can be used for geometric verification. In addition, when 
more features (e.g., 500 features) must be transmitted, the 
increase in bitrate incurred by our proposed scheme is much 
smaller than that of other schemes. Moreover, other types of 
keypoints (e.g., SURF ig, MSFR 1^, and FAST |^) can 
be similarly encoded for improved feature extraction using the 
proposed framework. 
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